Systems and methods for managing and analyzing faults in computer networks

ABSTRACT

A system ( 100 ) for analyzing a fault includes a fault object factory ( 110 ) constructed and arranged to receive fault data and create a fault object ( 112 ), and a fault diagnosis engine ( 101 ) constructed and arranged to perform root cause analysis of the fault object. The system may further include a fault detector ( 130 ) constructed and arranged to detect the fault data in a monitored entity, a fault repository ( 140 ) constructed and arranged to store and access the fault object; and a fault handler ( 150 ) constructed and arranged to be triggered by the fault diagnosis engine to analyze the fault object.

[0001] This application claims priority from U.S. ProvisionalApplication 60/202,296, entitled “Construction of a Very Rich,Multi-layer Topological Model of a Computer Network for Purposes ofFault Diagnosis,” filed on May 5, 2000, and claims priority from U.S.Provisional Application 60/202,299, entitled “A method for diagnosingfaults in large multilayered environments guided by path and dependencyanalysis of the modeled system,” filed on May 5, 2000, and claimspriority from U.S. Provisional Application 60/202,298, filed on May 5,2000, entitled “Method and apparatus for performing integrated computernetwork, system, and application fault management,” all of which areincorporated by reference in their entireties.

GENERAL DESCRIPTION

[0002] The present invention relates a fault management and diagnosissystem with a generic, easily extensible architecture.

[0003] The construction of computer networks started on a large scale inthe 1970's. Computer networks link personal computers, workstations,servers, storage devices, printers and other devices. Historically, widearea computer networks (WANs) have enabled communications across largegeographic areas, and local area networks (LANs) communications atindividual locations. Both WANs and LANs have enabled sharing of networkapplications such as electronic mail, file transfer, host access andshared databases. Furthermore, WANs and LANs have enabled efficienttransfer of information, and sharing of resources, which in turnincreased user productivity. Clearly, communications networks havebecome vitally important for businesses and individuals.

[0004] Communications networks usually transmit digital data in framesor packets created according to predefined protocols that define theirformat. Data frames include headers (located at the beginning andcontaining addresses), footers (located at the end of the frames), anddata fields that include the transmitted data bits (payload). Dataframes may have a fixed or variable length according to the usedprotocol or network type.

[0005] A communications network transmits data from one end station(i.e., a computer, workstation, server etc.) to another using ahierarchy of protocol layers (i.e., layers that are hierarchicallystacked). In the communication process, each layer in the sourcecommunicates with the corresponding layer in the destination inaccordance with a protocol defining the rules of communication. This isactually achieved by transferring information down from one layer toanother across the layer stack, transmitting across a communicationmedium, and then transferring information back up the successiveprotocol layers on the other end. To facilitate better understanding,however, one can visualize a protocol layer communicating with itscounterparts at the same layer level.

[0006] The open system interconnection (OSI) model has seven layers thatdefine the rules for transferring information between the stations. Aphysical layer (Layer 1) is responsible for the transmission of bitstreams across a particular physical transmission medium. This layerinvolves a connection between two endpoints allowing electrical signalsto be exchanged between them.

[0007] A data link layer (Layer 2) is responsible for moving informationacross a particular link by packaging raw bits into logically structuredpackets or frames. Layer 2 ensures good transmission and correctdelivery by checking errors, re-transmitting as necessary, and attachingappropriate addresses to the data sent across a physical medium. If adestination computer does not send an acknowledgment of frame receipt,Layer 2 resends the frame. The contention access methods (e.g., CSMA/CD,and Token Passing) are regarded as Layer 2 activities. Layer 2 may befurther divided into two sub-layers: Logical Link Control (LLC) andMedia Access Control (MAC). The MAC sublayer defines procedures thestations must follow to share the link and controls access to thetransmission link in an orderly manner. The MAC sublayer defines ahardware or data link address called a MAC address. The MAC address isunique for each station so that multiple stations can share the samemedium and still uniquely identify each other. The LLC sublayer managescommunications between devices over a single link of the communicationsnetwork.

[0008] A network layer (Layer 3) is set up to route data from onenetwork user to another. Layer 3 is responsible for establishing,maintaining, and terminating the network connection between two usersand for transferring data along that connection. Layer 3 addresses,messages, and determines the route along the network from the source tothe destination computer. Layer 3 manages traffic, such as switching,routing, and controlling the congestion of data transmissions.

[0009] A transport layer (Layer 4) is responsible for providing datatransfer between two users at an agreed level of quality. When aconnection is established, this layer is responsible for selecting aparticular quality of service (QoS), for monitoring transmissions toensure the selected QoS, and for notifying the users if the QoSdeteriorates. Layer 4 also provides for error recognition and recovery,repackaging of long messages into smaller frames of information, andacknowledgments of receipt.

[0010] A session layer (Layer 5) focuses on providing services used toorganize communication and synchronize the dialog that takes placebetween users and to manage the data exchange. The primary concern ofLayer 5 is controlling when users can send and receive concurrently oralternately. A presentation layer (Layer 6) is responsible for thepresentation of information in a way that is meaningful to networkusers. This may include character code transmission, data conversion, ordata compression and expansion.

[0011] Layer 6 translates data from both Layer 5 and from Layer 7 intoan intermediate format and provides data encryption and compressionservices. Layer 7 is an application layer that provides means forapplication processes to access the system interconnection facilities inorder to exchange information. This includes services used to establishand terminate the connections between users and to monitor and managethe systems being interconnected, as well as the various resources theyemploy.

[0012] As data is passed down through the layers, each layer may or maynot add protocol information to the data, for example, by encapsulatingframes with a header or removing the header, depending on the directionin the protocol stack. The individual protocols define the format of theheaders.

[0013] MAC address includes a source address and a destination address,which have a predefined relationship to a network station. Highernetwork layers provide a network address that has a logical relationshipestablished by a network administrator according to a predeterminednetwork addressing arrangement. The assigned network address conveysinformation that can be used by a router when routing frames through theinternetwork. If the network address is hierarchical, a router may use aportion of the address to route the packet to a higher-level partitionor domain in the internetwork. Some protocols are hierarchical othersare not so hierarchical routing may or may not be available.

[0014] The global network may be subdivided into IP networks, which inturn may be subdivided into subnets. An IP address includes a networknumber (assigned by IANA), a subnet number (assigned by a networkadministrator), and a host that identifies an end station. The hostnumber may be assigned by a network administrator, or may be assigneddynamically. This is a form of hierarchical addressing that is used byIP routing algorithms to perform hierarchical or prefix routingoperations. Routing algorithms maintain information of all higher-levelrouting environments in routing tables for domains by recording theirshortest unique address prefixes.

[0015] A station may support more than one network layer protocol. Suchstation has multiple network addresses and multiple protocol stacks thatpresent the same MAC address on a port for the different protocols.Thus, a multi-protocol stack station connected to both an IP and an IPXnetwork includes an IP network address and an IPX network address.

[0016] A communications network may include a number of network entities(or nodes), a number of interconnecting links and communication devices.A network node is, for example, a personal computer, a network printer,file server or the like. An interconnecting link is, for example, anEthernet, Token-Ring or other type network link. Communication devicesinclude routers, switches, bridges or their equivalents. As computernetworks have grown in size, network management systems that facilitatethe management of network entities, communication links andcommunication devices have become necessary tools for a networkadministrator.

[0017] A bridge or a switch is a Layer 2 entity that is typically acomputer with a plurality of ports for establishing connections to otherentities. The bridging function includes receiving data from a port andtransferring that data to other ports for receipt by other entities. Abridge moves data frames from one port to another using the end-stationMAC address information contained in the switched frames. Switchesinterconnect the communication media to form small domains of stations,such as a subnetwork. Subnetworks or subnets provide an organizationaloverlay to an internetwork that facilitates transmission of data betweenthe end stations, particularly for broadcast transmissions. The subnetfunctions to limit the proliferation of broadcast frames to stationswithin a broadcast domain.

[0018] A router is an intermediate station that interconnects domains orsubnets by providing path from a node on a first network to a node on asecond network. There are single protocol or multi-protocol routers,central or peripheral routers, and LAN or WAN routers. A peripheralrouter connects a network to a larger internetwork, and thus may belimited to a single protocol. A central router may be connected to adifferent board in a server or a hub and thus usually has amulti-protocol capability.

[0019] A router provides the path by first determining a route and thenproviding an initial connection for the path. A router executes networkrouting software that depends on the used protocol. A router can workwith different data-link layer protocols and thus can connect networksusing different architectures, for example, Ethernet to Token Ring toFDDI. Furthermore, there are routers of several levels, wherein, forexample, a subnetwork router can communicate with a network router.Organizing a communications network into levels simplifies the routingtasks since a router needs to find only the level it must deal with. Theuse of different network levels is shown in FIG. 1.

[0020] In general, a global communications network connects devicesseparated by hundreds of kilometers. A LAN covers a limited area ofmaximum several kilometers in radius connecting devices in the samebuilding or in a group of buildings. LANs usually include bridges orswitches connecting several end-stations and a server. In a LAN, abridge or a switch broadcasts traffic to all stations. Until a few yearsago, a LAN was user-owned (did not run over leased lines) with gatewaysto public or other private networks. When a user moved or changed to anend-station at another location on the network, a network administratorhad to rewire and reconfigure the user's station. This has changed withthe introduction of virtual LANs.

[0021] A virtual LAN (VLAN) is a logical Layer 2 broadcast domain, whichenables a logical segmentation of the network without changing thephysical connections. A VLAN enabled switch segments the connectedstations into logically defined groups. Broadcast traffic from a serveror an end-stations in a particular VLAN is replicated only on thoseports connected to end-stations belonging to that VLAN. The broadcasttraffic is blocked from ports with no end-points belonging to that VLAN,creating a similar type of broadcast containment that routers provide.VLANs may also be defined between different domains connected by arouter. In this case, the router passes network traffic from one domainto the other (as done without defining a VLAN), and passes networktraffic from one VLAN to the other. The router also passes networktraffic between VLANs that are in the same domain because VLANs do notnormally share user information. The router is configured as a member ofall VLANs.

[0022] Virtual Private Networks (VPNs) have been designed tointerconnect end-stations that are geographically dispersed. Forexample, owners of large communications networks can provide centralizedmanagement services to small and medium sized businesses. The providercan configure VPNs that interconnect various customer sites ingeographically separate locations. These VPNs offer privacy and costefficiency through sharing of network infrastructure. Various VPNs havebeen proposed with various degrees of security, privacy, scalability,ease of deployment and manageability.

[0023] A global communications network may use a different levelsdifferent routing and connection management protocols such asInternational Standards Organization (ISO) Open Systems Interface (OSI)Intermediate Systems to Intermediate Systems (IS-IS), and Internet openShortest Path First (OSPF) protocols are used for connectionless routingof data frames. Asynchronous Transfer Mode (ATM) Forum PrivateNetwork-Network-Interface (PNNI) protocol is used for connectionoriented multi-media services. The routing protocols identify a networknode using a global address of a Route Server Element (RSE). The RSEsgenerate routing that identifies optimal routes for communicationthroughout the network. The RSE is responsible for administration of thealgorithms that enable a node to keep its view of the network topologyand performance metric current, referred to as Routing InformationExchange (RIE). Thus an RSE usually acts as a central element for therouting of traffic through the node.

[0024] In general, the use of WANs, LANS, VPNs, and VLANs has increasedthe number and complexity of communications networks. These networkscontinuously evolve and change due to growth and introduction of newinterconnections, topologies, protocols, or applications. Furthermore,most networks have redundant communication paths to prevent portions ofthe network from being isolated due to link failures. Also, multiplepaths can be used simultaneously to load-balance data between the paths.However, redundant paths can also introduce problems such as formationof loops. Furthermore, network performance can degrade due to impropernetwork configurations, inefficient or incorrect routing, redundantnetwork traffic or other problems. Network hardware and software systemsmay also contain design flaws that affect network performance or limitaccess by users to certain of the resources on the network. Thesefactors make network management complex and difficult.

[0025] A network management process controls and optimizes theefficiency and productivity of a communications network. A networkmanagement station manages the network entities (e.g., routers bridgesswitches, servers, storage devices, computers, printers) using a networkmanagement protocol such as a Simple Network Management Protocol (SNMP),Internet Control Message Protocol (ICMP), or another network managementprotocol known in the art. Using a network management protocol, thenetwork management station can deliver information or receiveinformation by actively polling the network entities or by receivingunsolicited information from the network entities. Using SNMP, a networkmanagement station can executes a set, get, or get-next functions to setand retrieve information from a network entity. This information may bestored within the polled network entity as Management Information Base(MIB). The network management station can receive unsolicitedinformation from a network entity in the form of an SNMP trap. Networkentities may send SNMP traps to the network management station when aproblem in the network or network entity occurs.

[0026] A network management station may be implemented using any generalpurpose computer system, which is programmable using a high-levelcomputer programming language or using specially programmed, specialpurpose hardware. The hardware includes a processor executing anoperating system providing a platform for computer programs that runscheduling, debugging, input-output control, accounting compilation,storage assignment, data management, memory management, andcommunication control and other services. The application programs arewritten in high level programming languages.

[0027] A network management station can include a network manager unit,a network communication interface, a data acquisition unit, a datacorrelation unit, and a graphical user interface. The data correlationunit interprets data received through the data acquisition unit andpresents the interpreted data to a user on the graphical user interface.The network communication interface may include transport protocols andLAN drivers used to communicate information to the communicationsnetwork. The transport protocols may be IPX, TCP/IP or other well-knowntransport protocols. The LAN drivers may include software required totransmit data on a communications network through the network interface.The LAN drivers are generally provided by the manufacturer of thenetwork interface for a general purpose computer for the purpose ofcommunicating through the network interface. The network manager unitmay be an SNMP network manager/agent implementing SNMP functions, oranother type of network manager unit performing associated managementfunctions. The network manager unit utilizes the network communicationinterface to transfer requests to network entities over a communicationsnetwork.

[0028] A network management station may use a network management agentresiding on a network entity. The network management agent may be asoftware process running on a processor or may be special purposehardware. The network management agent may be an SNMP agent (or ICMPagent), which may include a data collection unit, a network managerunit, and a network communication interface for communication asdescribed above. For example, this communication may use networkmanagement functions such as SNMP functions. Alternatively, a networkmanagement agent, residing on a network entity, may include a datacorrelation unit, a data collection unit, a network manager unit and anetwork communication interface for communication.

[0029] There are prior art network management systems (NMS) that detecta fault and represent the fault status in the form of a single Booleanattribute of the model representing a faulty network element in a NMSdatabase. Here, the fault status represents the NMS's ability to contacta network element using common management protocols such as a SNMPprotocol or an ICMP protocol.

[0030] There are also prior art NMS that include objects, calledinference handlers. Inference handlers perform work based on changes toa managed entity's attribute. In an NMS, the inference handler providesthe intelligence behind the objects. An inference handler can performdifferent functions such as fault isolation or suppression, but theseare frequently based on the NMS's ability to contact the networkelement, which is used as the fault status attribute. The NMS can thensuppress the fault status of a network element depending on the statusof other neighboring network elements. Frequently, however, loss ofcontact information in an NMS database does not sufficiently representvarious problems a network element can experience as a result of a faultin a communications network.

[0031] In general, there is a need for a fault management and diagnosisprocess that can provide a generic, open framework applicable to anysystem.

SUMMARY OF THE INVENTION

[0032] The present invention is a system, a method and a product (thatcan be stored in a computer-readable storage medium) for diagnosing oranalyzing faults of various types (including a complete or partialfailure).

[0033] According to one aspect, a method or system for analyzing a faultincludes a fault object factory constructed and arranged to receivefault data and create a fault object; and a fault diagnosis engineconstructed and arranged to perform root cause analysis of the faultobject.

[0034] Preferably, the method or system may further include one of moreof the following: a fault detector constructed and arranged to detect afault in a monitored entity; a fault repository constructed and arrangedto store and access the fault object; and a fault handler constructedand arranged to be triggered by the fault diagnosis engine to analyzethe fault object. The fault handler includes a fault handler tester anda fault handler diagnoser.

[0035] According to another aspect, a method or system for analyzing afault including means for receiving fault data, means for creating afault object; and means for performing a root cause analysis on theobject to determine a root cause.

[0036] Preferably, the method or system may further include one of moreof the following: Means for creating a fault object includes a faultobject factory using fault data or a detector remotely located from thesystem. Means for performing the root cause analysis includes means forinvoking specific fault handlers. Means for employing fault handlersincludes employing a diagnoser fault handler or a tester fault handler.Means for employing fault handler includes obtaining an ordered list offault handlers for a specified transition state of the fault object.Means for obtaining the ordered list includes employing a diagnoserfault handler registered for the type of the analyzed object. Thediagnoser fault handler transitions fault object between processingstates.

[0037] The present system and method provide a generic, open frameworkthat implements a fault diagnosis engine for controlling the entireprocess, a fault object factory for creating fault object, a faultrepository for receiving and storing fault objects, and fault handlersused for performing fault correlation and root cause analysis.

[0038] The fault management and diagnosis system may be used fordiagnosing faults in any system or device (for example, a mechanical orelectronic device, a communications network, a material transfernetwork, a shipping network). The fault diagnosis engine receivesdetected fault information from multiple sources, controls the faultmanagement, and executes a root cause analysis. The fault diagnosisengine also provides a mechanism for fault correlation and fault impactassessment. In communications networks, the impact assessment isapplicable to both disruptions in services (or applications that dependon the network infrastructure) and to reduction of network performancedue to the fault.

[0039] As mentioned above, the fault management and diagnosis systemuses a fault object factory that creates fault records, called faultobjects that store some or all information pertaining to a singlenetwork problem. Each fault has a processing state, which guides thefault through its life cycle. The fault management and diagnosis systemuses fault handlers that are specifically designed to be triggered uponchanges in the state of a given type of fault. The fault handlersperform various aspects of the automated fault management processdescribed below.

[0040] Advantageously, the present system creates a fault hierarchy ortree as a result of diagnosis of a single detected problem in a managedsystem and this facilitates root cause isolation. The fault treefacilitates a log of the entire diagnosis process for the analyzedfault, and inferred impact calculation based on the association offaults in the tree. The fault tree also facilitates fault resolution andre-evaluation because the conditions tested during the originaldiagnosis of a problem are recorded in the tree, and the ability tocontrol the processing of faults based on fault state transition.

BRIEF DESCRIPTION OF THE DRAWINGS

[0041]FIG. 1 shows diagrammatically several network management modulesconnectable to a communications network.

[0042]FIGS. 2 and 2A are block diagrams of a fault management anddiagnosis process.

[0043]FIG. 3 is a block diagram of modules employed in a faultmanagement and diagnosis system.

[0044]FIGS. 3A and 3C are block diagrams of objects employed in thefault management and diagnosis system of FIG. 3.

[0045]FIG. 3B is a block diagram of a fault repository module employedin the fault management and diagnosis system of FIG. 3.

[0046]FIG. 4 is a flow diagram that illustrates a triggering mechanismfor fault handlers by a fault diagnosis engine shown in FIG. 3.

[0047]FIGS. 5 and 5A are block diagrams depicting processing states of afault during fault analysis.

[0048]FIGS. 6, 6A, 6B, and 6C are block diagrams of a fault treeaccording to one preferred embodiment.

[0049]FIGS. 7, 7A and 7B are block diagrams of a fault tree according toanother preferred embodiment.

[0050]FIGS. 8, 8A, 8B, 8C and 8D are block diagrams of a fault treeaccording to yet another preferred embodiment.

[0051]FIG. 9 illustrates a sample network analyzed by the faultdiagnosis and management system.

[0052]FIGS. 10, 10A and 10B are block diagrams of a fault tree createdby the fault diagnosis and management system analyzing the samplenetwork of FIG. 9.

DESCRIPTION OF PREFERRED EMBODIMENTS

[0053]FIG. 1 shows diagrammatically a network management system 10including a fault diagnosis system 12, a topology mapper 14, an impactanalyzer 16 and a help desk system 18. The network management systemcommunicates with a communications network 20 (or application service).The network includes a set of interconnected network elements such asrouters, bridges, switches, and repeaters. These network elementsprovide transportation of data between end stations. Furthermore, thereare computers known as servers that provide services such as e-mail,accounting software, sales tools, etc. Typically, data is transmittedelectronically or optically, and network elements can forward data inpackets, frames or cells to the intended destination. Servers includenetwork adapters and/or software that interpret the electronic oroptical data packet into the data elements and pass these elements tothe appropriate application being hosted.

[0054] The network management system 10 includes a commerciallyavailable processor (for example, Pentium microprocessor manufactured byIntel Corporation) executing an operating system providing an operatingenvironment for a network management program. The processor and theoperating system provide a computer platform for which applicationprograms are written in higher level programming languages. The computer(or application host) interfaces with permanent data storage, such as amagnetic or optical disk drive, a disk array, non-volatile RAM disk, ora storage area network, which maintain data files such as userconfigurations and policies. In general, the network management programmay be configured as a generic software application residing in anycommercially available computing platform.

[0055] Preferably, fault diagnosis system 12, topology mapper 14, andhelp desk system 18 are software applications written in Java andrunning on any computer with a Java Runtime Environment (JRE). Forexample, a Dell laptop computer with an Intel Pentium processor runningthe Windows 2000 operating system, or a Sun Ultra 60 computer runningSolaris v. 2.7. Alternately, fault diagnosis system 12, topology mapper14, and help desk system 18 are developed in any object oriented orstructured programming language, and compiled for execution on any oneor many computer platforms, or could be implemented on a neural networkcomputing device.

[0056] The computer has a network adaptor that provides communication(preferably, but not necessarily, IP) to the users on the network. Thefault diagnosis engine application may share a host with help desksystem 18, and/or the topology mapper, or each can run on a separatehost, in which case they communicate using a network adaptor. Topologymapper 14 determines the network topology and creates a model. Thepermanent data storage holds data files that describe the currentnetwork topology, and configuration files that control the performanceof topology mapper 14. A user is an end station, interfaced to accessthe network or services, used by a person who is using the network, oris using services provided by the network.

[0057] The network management system 10 performs a fault managementprocess 30 shown in FIG. 2. The entire process is part of a phased,componentized, but interconnected method, wherein all aspects of faultmanagement are performed. The fault management process of FIG. 2includes the following seven phases: fault detection 32, diagnosis 40,impact analysis 50, prioritization 60 presentation 70, recourse 80, andresolution 90.

[0058] Fault detection process 32 (performed by fault detectors 130shown in FIG. 3) is the most basic part of the fault management system.Fault detectors 130 detect raw fault data. Fault detectors 130 receiveinformation by SNMP polling, SNMP trap handling, performance monitoring,historical trend analysis, device configuration monitoring, applicationand system-level management tools, and help desk trouble tickets. Faultdetection process 32 can also add information to the raw fault dataenabling improved diagnosis of the fault. The fault data are assembledinto fault objects.

[0059] Fault diagnosis 40 occurs after a “detected” fault is enteredinto a fault detection and management system 100, which is a genericsystem for diagnosing a fault in any a mechanical, electrical or othersystem. A fault detection and management system 100 (FIG. 3), processesand correlates detected faults with other faults to determine theirrelationship. Fault detection system 100 finds one or more “root cause”faults and isolates these faults. Furthermore, the system can optionallysuppress other symptomatic faults that were “caused” by the root causefault. Fault diagnosis 40 can be performed in a single step or caninvolve many techniques such as examining device neighbor knowledge,tracing the route of management data, examining route tables and ACLs,etc.

[0060] Fault impact analysis 50 determines the “scope” of the analyzedfault. After receiving a root cause fault determined, by fault diagnosis40, impact analysis 50 determines the consequences of this fault. Thisdetermination includes analyzing the network services affected by thefault, the users affected by the fault, and any other ramifications thefault has on network 20, or the application being managed. Furthermore,impact analysis 50 may involve analyzing various logical layers thatexist in a communication network and correlating a fault with itspossible consequences at each layer. Impact analysis 50 may use a faultcausality tree located in a fault repository 140 (FIG. 3). Theinterpretation schemes include analyzing how a network fault affectsservices like web servers or e-mail, examining how a misconfiguredrouter running OSPF affects the users in each area, etc.

[0061] The network management system may also perform faultprioritization 60. After a fault has been diagnosed and its impactanalyzed, the fault may be prioritized. Fault prioritization 60 assignsa priority/severity to each fault object and this is used to determinethe proper presentation of the fault to a user. Fault prioritizationprocess 60 may include multiple methods based on the type and scope ofthe fault such as examination of service level agreements and how thefault violates them, mission critical device analysis, and fault scope.

[0062] The network management system may also perform fault presentation70. Fault presentation 70 provides the mechanism by which the systemalerts a user that a fault has occurred. Fault presentation process 70presents all information about the fault in a user friendly manner.Fault presentation 70 may include steps and processes the systems usedto diagnose the fault, thus allowing a user to verify the diagnosis and“trust” the system to accurately diagnose faults. Fault presentation 70may also include a network monitoring alarm system.

[0063] The network management system may also include fault recourse 80.Fault recourse 80 provides a way in which a user can change the networkmanagement based on a given fault. For example, fault recourse 80 mayinvolve reducing or stopping polling of devices downstream from a fault,reconfiguring connectivity modeling, script invocation to fix amisconfigured static route, or configuring user groups for a differentemail server.

[0064] The network management system may also include fault resolution90. After presenting a fault to a user and fixing the problem, problemresolution 90 records the process for future fault detection anddiagnosis. Fault resolution 90 can automatically trigger for any singleresolved fault a re-evaluation of associated faults in the system. Thisre-evaluation proactively assesses the full scope of a resolved fault.If an associated fault is still not resolved, diagnosis can bere-started to determine the cause. This process is facilitated by theuse of the fault causality tree created as a result of fault diagnosisprocess 40.

[0065]FIG. 2A shows diagrammatically in detail fault diagnosis process40. A detected fault enters the fault detection and management systemand a fault object is created (step 42). The fault diagnosis engine (101in FIG. 3) triggers appropriate fault handlers (step 43). A diagnoserfault handler generates possible faults that may be causes of thepreviously entered fault (step 44). For each generated, possible fault,fault diagnosis engine 101 triggers appropriate tester fault handlers(step 45). Each tester fault handler performs vendor-specific anddomain-specific tests to determine the existence of one or severalpossible faults. Next, the tester fault handler records test results(step 46). If possible additional faults exist, the fault diagnosisengine continues to trigger tester fault handlers and diagnoser faulthandlers (step 49). If there are no other possible faults, the faultdiagnosis engine has isolated the fault and the system proceeds toimpact analysis 50.

[0066]FIG. 3 illustrates diagrammatically a fault detection andmanagement system 100. One embodiment of fault detection and managementsystem 100 is fault diagnosis system 12 (FIG. 1). Fault detection andmanagement system 100 includes five main parts: a fault diagnosis engine101, a fault object factory 110, fault detectors 130, a fault repository140, and fault handlers 150. Fault detection and management system 100has the ability to receive detected fault information from multiplesources, control the management of the faults, and produce a root causeanalysis. Furthermore, the system also provides a mechanism forperforming fault correlation and impact analysis. The impact assessmentis not limited to the impact of the communications network, but mayinclude disruptions in services or applications that depend on thenetwork infrastructure.

[0067] Fault object factory 110 receives data from fault detectors 130and creates fault objects 112 shown in FIG. 3A. Each fault object 112 isassociated with a fault type and there may be many fault types.Furthermore, each instance is a separate occurrence of a problem,potential problem, or condition of a communication network or an elementlocated in the communication network (such as a misconfiguration, adegradation of service, physical failure or other).

[0068] Referring to FIG. 3A, the entire architecture of the faultdetection and management system based on fault objects 112, which arerecords representing a detected problem, a potential problem, or acondition. Fault object 112 includes information about a detected fault,that is, includes a description of the problem or condition stored infield 114, time and date of the reported problem 116, a fault processingstate 118, and one or more test result objects 120. The fault structureincludes a context that is a mechanism for sharing varying amounts ofdata related to the fault; these amounts may exist between eachinstantiation of a type of fault.

[0069] Referring to FIG. 3, fault detector 130 detects a problem orpotential problem on an entity in a managed system. Fault detector 130provides a record of the condition to fault object factory 110, whichgenerates fault object 112. Fault detector 130 can monitor an entity orreceive unsolicited notification from an entity when a problem occurs,according to different methods known in the art. Fault detector 130 mayperform a test and may provide to fault object factory 110 data with theresults of the performed tests. Fault detector 130 may share a host withfault diagnosis engine 101, or may reside externally as an agent.

[0070] Referring to FIG. 3B, fault repository 140 is the component usedby a fault detection and management system 100 to store and access faultinformation. fault repository 140 stores every fault object 112 presentin the system. Each component of the system (detection, diagnosis, etc.)can enter new fault objects into fault repository 140 and access anyfault object 112. Preferably, fault repository 140 includes a tablestructure with services capable of searching and locating existingfaults.

[0071] Fault repository 140 also includes fault associations 142, whichprovides a mechanism for relating faults to one another. Specifically,each defined fault association relates two fault objects. One faultobject is on the left side of the association, and the other faultobject is on the right side as shown for fault trees below. Thesemantics of an association are defined by the type of the association.New fault association types can be defined and added to the system,preferably using Interface Description Language (IDL) definitions of aninterface for a service that uses the Common Object Request BrokerArchitecture (CORBA) transport protocol.

[0072] Referring again to FIG. 3, each fault handler 150 performs adesignated type of work as a result of a fault object entering a certainprocessing state (shown in FIG. 5). Fault handlers 150 may existinternal to the system, or reside externally in a separate process.Fault handlers 150 are registered for a particular fault type and stateand, as part of the registration process, each fault handler 150 has aninteger priority value. Then, fault handlers 150 are sorted by theirpriority values so that a fault handler with the lowest priority valueis triggered first and subsequent handlers are triggered in sequence, asdescribed below. One type of fault handler 150 can test a fault objectand create a test result record. Furthermore, fault handler 150 maycreate additional types of fault objects, create associations betweenfault objects, correlate fault objects that indicate a similar problem,or perform impact analysis on a fault object to determine the scope of aproblem. A tester fault handler 152 performs a selected test on a fault.A diagnoser fault handler 154 creates additional types of fault objects.

[0073] Fault diagnosis engine 101 is the central component of faultdetection and management system 100 since it drives the management anddiagnosis of faults. Fault diagnosis engine 101 provides a genericmechanism for fault handlers 150 to register for changes in theprocessing state of faults of a given fault type. Fault diagnosis engine101 may employ any mechanism to specify registrations. The preferredimplementation of fault diagnosis engine 101 uses XML (Extensible MarkupLanguage) technology.

[0074] Referring to FIG. 4, when a fault transitions to a state forwhich a handler has registered, the engine triggers the handler toperform its work. Fault diagnosis engine 101 can trigger one of faulthandlers 150 arbitrarily or may use some ordering mechanism. Preferably,fault diagnosis engine 101 uses a priority mechanism to order thetriggering of fault handlers that are sorted by their priority value (bytriggering first a fault handler with the lowest value).

[0075] Fault detection and management system 100 uses fault processingstates for analyzing faults. A fault's processing state represents itsstatus in the fault management process and provides a mechanism tocontrol the management of the fault. A fault can have a large number ofpossible states, and a fault can transition from state to state usingdifferent ways, as shown in FIGS. 5 and 5A. Preferably, the systemutilizes a fault type hierarchy in which generic base fault types aredefined and from which, new more specific fault types can be derived.Each fault, which exists in the system, is of some pre-defined faulttype.

[0076] Referring to FIG. 3C, a test result object 120 includes a recordof test results that were performed to determine the existence of theproblem or condition for which the fault was created. Test result object120 includes a textual description of the test (field 122), dataidentifying from the target of the fault (field 123), test data (field124), any thresholds and parameters used in determining the test result(field 125). Test result record 125 also contains a state representingthe status of the test.

[0077] While performing its work on a fault object, a fault handler maycause the processing state of the fault to be changed. In this case, noother handlers for the current state are triggered. Fault diagnosisengine 101 obtains the handlers for the new state and resumes triggeringwith the new handlers when the current handler completes its work.

[0078]FIG. 4 illustrates the triggering mechanism using a flow diagram.Fault diagnosis engine 101 provides a triggering mechanism and controlsand manages the diagnosis process.

[0079] Referring to FIG. 5, fault diagnosis engine 101 utilizesprocessing states of a fault to control the flow of diagnosis for thatfault. As described above, fault handlers 150 are triggered for a faultbased on the current processing state. The transition diagram of FIG. 5defines the following processing states: An initial state 180 begins thelife-cycle of a fault object. A detected state 182 indicates that anexternal fault detector 130 or an internal handler 150 positivelydetermined the condition (that the fault represents) as a problem. Atesting state 184 indicates the fault is unverified; that is, thecondition that the fault represents requires testing to determine if itis a problem. A completed state 184 indicates that fault diagnosis hascompleted for the fault.

[0080] Fault diagnosis engine 101 may allow fault handlers 150 todirectly transition a fault between states, wherein preferably theprocessing state is hidden from fault handlers 150. The enginetransitions a fault's processing state based on the state of the currentresult of the fault as provided by the handlers. These are the followingthree test result states (shown in FIG. 5A): PROBLEM indicates a testhas identified the fault to be a problem; NO_PROBLEM indicates a testhas verified the condition that the fault represents does not or nolonger exists; and UNKNOWN indicates a test could not be completed forsome reason or the condition that the fault represents requiresverification.

[0081]FIG. 5A illustrates transition of the processing states (shown inFIG. 5) based on test results of an analyzed fault. For example, faultdiagnosis engine 101 triggers tester fault handler 152 (FIG. 3) fortesting state 184 and fault handler diagnoser 154 for detected state182. Furthermore, fault handler diagnoser 154 may also be triggered fortesting state 184 if there are no tester fault handlers that can performa direct test. There may also be fault handlers for completed state 184,which would not perform diagnosis, but would perform other tasks such ascorrelating faults that share a common root cause (described below) ornotifying a presentation system to display the diagnosis results whenperforming presentation process 70.

[0082] Fault diagnosis engine 101 may employ further rules that governthe triggering of fault handlers when there are multiple handlers (ortypes of handlers) for a particular processing state. If there aremultiple types of handlers, the engine may impose an ordering such thatall handlers of one type are triggered before any handlers of anothertype. Furthermore, if a handler provides a concrete result, as definedby the various result states, the engine may suppress remaining handlersof that type from being triggered and/or may suppress handlers of othertypes.

[0083] According to the preferred embodiment, since there may be bothtester fault handlers 152 and diagnoser fault handlers 154 registeredfor testing state 184, fault diagnosis engine 101 imposes a rule thatall tester fault handlers are triggered before any diagnoser faulthandler. This is because a tester fault handler can directly determinethe existence or nonexistence of a problem, but a diagnoser faulthandler cannot. In addition, if a tester fault handler or diagnoserfault handler provides a concrete result, then fault diagnosis engine101 suppresses remaining handlers for the current processing state. Aconcrete result is one whose state is either PROBLEM or NO_PROBLEM. Aresult state of UNKNOWN is not concrete, that is a result could not bepositively determined, as shown in FIG. 5A.

[0084] Fault detection and management system 100 utilizes adecomposition approach in the diagnosis of a fault to determine the rootcause. Fault detector 130 enters a problem or potential problem intofault object factory 110, which creates a fault object treated as asymptom fault. The symptomatic fault is decomposed into one or moreconstituent faults that further refine the symptom as shown in FIG. 6.Each constituent fault represents a possible suspect that may be causingthe symptom. For each constituent fault, tests may be performed todetermine the existence of a problem or the fault may be decomposed intofurther suspects. The process continues until all faults have beencompletely decomposed and there are no more suspects.

[0085] The end result of this process is a hierarchy of faults in theform of a tree with the original symptomatic fault at the root as shownin FIG. 6. The fault tree includes a root fault level, one or severalintermediate fault levels, and a leaf fault level. Each fault in thetree, except the root, has at least one parent fault from which it wasdecomposed. Each fault also has zero or more child faults that werespawned from it. A child fault represents a possible cause of itsparent. A fault that has children but is not the root is termed anintermediate fault. A fault that has no children, that is one that couldnot be further decomposed, is termed a leaf fault. A leaf fault thatindicates a problem is a probable cause of the root symptom. There maybe more than one root cause.

[0086] Referring to FIG. 6, fault A is the original root symptom. FaultA is decomposed into faults B and C, fault B is decomposed into faults Dand E, and fault C is decomposed into faults F and G. Faults B and C areintermediate faults because they have children but are not the root.Faults D, E, F, and G are all leaf faults.

[0087] Fault tree 200 enables fault detection and management system 100to locate one or several root causes of any fault in the tree bytraversing the children of that fault and compiling the leaf fault(s)that indicate a problem. Fault tree 200 as a whole also embeds theentire diagnosis process. By traversing the entire sub-tree of anyfault, one can compile a complete log of the steps taken and the resultsof tests performed to diagnosis the fault. Thus, a presentation process70 can display the root cause(s) of a fault and/or can present adiagnosis log allowing an end user to verify the process.

[0088] Referring to FIG. 3, fault diagnosis engine 101 manages thestructure of fault tree 200. Fault handlers 150 provide the contents andsemantics of the tree. For each fault in fault tree 200, one or morefault handlers 150 are triggered. Fault handler 150 may perform aspecific test on the fault and provide a result of the test to theengine or it may create one or more child faults to find a possiblecause. Each new child fault creates a new branch in the fault tree. Abranch may be represented preferably by a fault association calledMaybeCausedBy shown in FIG. 6.

[0089] Tester fault handler 152 performs a direct test and a diagnoserfault handler 154 spawns possible suspect faults. Other types ofhandlers may correlate similar faults or perform impact analysis. Faulthandler 150 could be both test fault handler 152, and diagnoser faulthandler 154, which can perform a test, provide a result and also attemptto find the cause. Preferably, a handler is not both test fault handler152 and diagnoser fault handler 154. Furthermore, if diagnoser faulthandler 154 does not provide a direct result for a fault object, acomposite result is computed from the results of the fault's children(shown, for example, in FIG. 6). Fault detection engine 101 or diagnoserfault handler 154 may compute the composite result according to rulesprovided below.

[0090] Referring to FIG. 6A, fault A is entered into the system bydetector 130 as a problem. Also referring to FIG. 5A, the fault beginsin initial state 180 and is transitioned to detected state 182.Diagnoser fault handler 154 creates two child faults, B and C. AMaybeCausedBy association is created between faults A-B and A-C, asshown in FIG. 6A.

[0091] The current result state for each fault is shown in the upperright corner of each fault box. The result state for faults B and C isindicated by a question mark (?) because a result has not yet beencomputed. Since faults B and C are unverified, the engine transitionsthese faults to testing state 184. Diagnoser fault handler 154 createsfor fault B new faults D and E, and creates for fault C new faults F andG. Next, fault diagnosis engine 101 triggers tester fault handlers 154for faults D, E, F, and G and these testers assert results shown in FIG.6B. Since results have been computed for faults D, E, F, and G, faultdiagnosis engine 101 transitions these faults to the completed state 186(FIG. 5A). Next, fault diagnosis engine 101 computes a composite resultfor faults B and C according to the following default rules forcomposite result computation:

[0092] 1. If any child fault result state is PROBLEM, then the parentfault's result state is PROBLEM.

[0093] 2. If all child fault result states are NO_PROBLEM, then theparent fault's result is NO_PROBLEM.

[0094] 3. Otherwise, the parent fault's result is UNKNOWN.

[0095] Using the above rules, the composite result for fault B isNO_PROBLEM and the composite result for fault C is PROBLEM. Thus fault Fis the cause of fault C. The system indicates this causality withanother association called CausedBy as shown in FIG. 6C.

[0096] The fault diagnosis is now complete on faults B and C so theytransition to completed state 186. The composite result for A is PROBLEMsince the result state for fault C is PROBLEM and a CausedBy associationis created between A and C as shown in FIG. 6C.

[0097] As described above, the system executed root cause analysis anddetermined the root cause for symptomatic fault A is fault F. Thediagnosis log for fault A shows that the conditions tested for faults Dand E did not indicate a problem and that a result for fault G could notbe determined, possibly because of the problem on fault F.

[0098] If the resulting fault tree did not find a root problem, then thecomposite result for fault A would indicate NO_PROBLEM. Such resultwould contradict the original assertion of a PROBLEM. In this case, theengine would throw out the composite result and leave the originalPROBLEM result. Such a problem may have been intermittent and resolveditself quickly, or the detector was faulty, or the diagnosis wasincomplete, perhaps requiring additional testing and diagnosis.

[0099] Referring to FIG. 7, according to another situation, multiplesymptomatic faults may be caused by the same root problem. This mayresult in portions of multiple fault trees being identical, that is thediagnosis for multiple symptomatic faults would create the same faultswith possibly the same results. Consider two symptomatic faults, A and Bshown in FIG. 7. Both faults enter the system at the same time and faulthandlers for both faults A and B attempt to create same fault C. Theengine can handle this scenario in several different ways shown in FIGS.7 and 7A. A simple implementation would create two copies of fault C,one for each fault tree. Each fault tree would only be aware of its copyand, thus, each copy of fault C would be diagnosed separately. This isdepicted in FIG. 7 with C being the copy. The above approach may resultin tests being performed for the same condition twice, which may not bedesirable for performance reasons.

[0100] Alternatively, the system creates two copies of fault C but“reuses” the results of test(s). For example, consider that fault C iscreated first and subsequently tested. A short time later fault C′ iscreated. Instead of performing the same test again, the engine would usethe test result from fault C for fault C′. A drawback to this approach,however, is that, depending on the semantics of the test, a significantamount of time may have passed such that the result computed for fault Cmay be invalid for fault C′, that is the result for C is now “stale”. Toalleviate this issue, the system may employ certain rules or heuristicsto determine when and if a test result can be reused. The system mayonly reuse the results from certain tests or may only reuse a resultdepending on its state. For example, using the test result statesdefined above in the preferred implementation, a NO_PROBLEM result mayalways be re-tested but a PROBLEM or UNKNOWN result may be reusedregardless of the time elapsed. The engine may also “age” test results.For example, if fault C′ occurs within a certain amount of time afterfault C as determined by some heuristic “aging” factor, then the resultfor C can be used. Otherwise fault C′ is re-tested. An “aging” factormay be defined system-wide or an “aging” factor may be specified perfault type or per test. A system implementation may utilize only one setof rules or heuristics for test result reuse or may use a combination ofapproaches.

[0101]FIG. 7A depicts another approach to handling identical fault treeportions or sub-trees is to share sub-trees amongst multiple symptomaticfaults. In this embodiment, instead of multiple fault super-treesmaintaining their own copy of a particular sub-tree, the super-treeswould “intersect” at certain faults and share a common sub-tree.Therefore fault C is shared by faults A and B.

[0102] Thus, faults A and B would share fault C and its associated testresult(s). A similar issue exists, however, regarding “stale” testresults, as described above. Similar rules or heuristics can be appliedhere as well. If fault B intersects with fault C some time after faultA, these rules can be applied to determine if fault C needs to bere-tested. FIG. 7B illustrates a more complete example using the testresult states defined above in the preferred implementation. In thediagram of FIG. 7B, faults A and F intersect at fault C and share itsPROBLEM test result. The root cause of both A and F is fault D.

[0103] According to another important aspect, fault management anddiagnosis system 100 enables resolution and possible re-evaluation of apreviously detected problem. The system reacts when a monitoring systemdetects that a problem has been resolved or no longer exists. The systemmay remove faults and notify presentation 70 of the resolution. Thisprocess can be automated and/or require user intervention. If a userinstructs the system that a fault has been resolved, the system couldmerely remove the fault or could choose to re-evaluate the conditionrepresented by the fault and verify resolution.

[0104] The fault tree hierarchy can facilitate resolution andre-evaluation of faults. The system provides a mechanism allowing anobserver or detector to specify that a fault originally entered into thesystem as a problem has subsequently been resolved. Additionally,problems detected by internal handlers in the system may monitor thecondition to detect resolution. When a fault is deemed resolved, theengine would re-test all the faults, if any, in the sub-tree of theresolved fault and propagate the new computed result “up” to its parentfault. The parent fault would then re-test and propagate its new result.This process continues until the entire tree has been re-evaluated. Thismay result in the entire tree being resolved or the isolation of a newcause for the root symptomatic fault if it is determined that the rootsymptom is still a problem even though the original root cause has beenresolved.

[0105] As shown in FIG. 6C, the system diagnosed fault F as the rootcause of symptomatic fault A. Assume the fault handler that tested faultF is monitoring the condition and, some time later, detects that theproblem no longer exists. A new test result is asserted for fault F witha state of NO_PROBLEM that contradicts the original PROBLEM result shownin FIG. 6C. Fault diagnosis engine 101 then makes an “inference” on theparent fault of fault F with the new result state of fault F. Aninference on a fault is not an assertion of a new result but asuggestion used to determine if the fault needs to be re-evaluated. Ifthe inferred result state contradicts the current result state for thefault, the fault is re-evaluated. In this case, the engine infers aresult state of NO_PROBLEM on the parent fault of F, fault C. Since thisinference contradicts the current result state, PROBLEM, fault C isre-evaluated. Fault diagnosis engine 101 transitions the processingstate of fault C from completed state 186 back to testing state 184(shown in FIG. 5).

[0106] Fault diagnosis engine 101 re-triggers diagnoser fault handler154 for fault C (FIG. 8), which in turn attempts to re-create fault Fand fault G. However, since faults F and G already exist in the currenttree, no new faults are created. The engine then infers NO_PROBLEM onfaults F and G as shown in FIG. 8A. There is a contradiction for faultG, since its current result state is UNKNOWN, so it is transitioned backto testing state 184, but fault F remains in completed state 186 sinceno contradiction occurs and a new result was just asserted for it.Assume tester fault handler for fault G finds that G has been resolvedand asserts the NO_PROBLEM result state. Diagnosis for fault G iscomplete so its processing state is changed to completed state 186. Thecomposite result computed for fault C would now be NO_PROBLEM and theCausedBy association between C and F removed, as shown in FIG. 8A.

[0107] The new result state for fault C is now propagated to fault A,which causes A to transition back to testing state 184. However, sinceboth child faults B and C are NO_PROBLEM, the composite result for faultA now becomes NO_PROBLEM and the causality association between faults Aand C is removed, as shown in FIG. 8B.

[0108] The entire fault tree, shown in FIG. 8B, was re-evaluated exceptfor the sub-tree rooted at fault B, which was not re-evaluated becauseits original test result state was NO_PROBLEM and no contradictionoccurred (see FIG. 6C). If the engine employs an aging factor for testresults, as described above, then fault B could be re-evaluated if itsresult is deemed stale.

[0109] In the embodiment performing aging factor testing, faultdetection engine 101 re-evaluates fault results deemed stale.Specifically as shown in FIG. 8B, fault detection engine 101re-evaluates fault B's sub-tree. Assume that fault E is re-tested andthe result is NO_PROBLEM, but fault D re-tests with a new result ofPROBLEM. The composite result for fault B is now also PROBLEM and acausality association is created between faults B and D (FIG. 8C). FaultC is also re-evaluated. Fault detection engine 101 generates a compositeresult for fault A according to the above rules. Referring to FIG. 8D,based on results of faults B and C, fault A is PROBLEM as determined inthe fault tree of FIG. 6C, but now its root cause is fault D.

[0110] The above-described fault diagnosis and management process may beapplied to any system. In the following example, the above-describedfault diagnosis and management process is applied to diagnosing a faultthat occurred in a communications network 210 shown in FIG. 9.Communications network 210 includes a router R1 connecting a subnet 212and a subnet 220, and includes a router R2 connecting a subnet 220 and asubnet 225. Subnet 212 includes a bridge B1, a workstation 214, and aserver 216. Subnet 220 includes a bridge B2 and a workstation 222, andsubnet 225 includes a bridge B3 and a HTTP server 228. Workstation 214includes client agent with IP address 10.253.100.10. The client agent ismonitoring HTTP server 228 with IP address 10.253.102.15 by periodicallyrequesting an HTTP web page. The fault diagnosis and management processis running on workstation 222 at 10.253.2.104. A DNS server resides onmachine 216 having IP address 10.253.100.1.

[0111] For example, the client agent 214 tries to load an HTML page withthe URL “www.acme.com/index.html” from HTTP server 228, but receives noresponse. The client agent, acting as fault detector 130, sendsinformation about the failure to fault object factory 110. Fault objectfactory 130 creates a fault object of type HTTPLostFault from theinformation provided and its processing state is set to the detectedstate 182 (FIG. 5).

[0112] As shown in FIG. 4, fault diagnosis engine compiles a list offault handlers (step 172) registered for the HTTPLostFault type. Instep176, diagnoser fault handler 154 registered for the HTTPLost fault typeis triggered. Diagnoser fault handler checks for two possible causes ofthe problem: a DNS service failure and an HTTP service failure. The DNSis checked because name resolution may be needed on the client side todetermine the IP address of the HTTP server. Thus, the diagnoser createstwo faults: one of type DNSServiceFailure fault 234 and the other oftype HTTPServiceFailure fault 238, as shown in FIG. 10.

[0113] For DNSServiceFailure fault 234, diagnoser fault handler 154creates two faults: one of type DNSServerDown fault 235 and the other oftype FullContactLost fault 236. Similarly, diagnoser fault handler 154for the HTTPServiceFailure fault 238 creates two faults: one of typeHTTPServerDown fault 239 and the other of type FullContactLost fault240, as shown in FIG. 10.

[0114] For the HTTPServerDown fault 239 tester fault handler 152 makes aquery to agent running on HTTP server machine 228 at IP 10.253.102.15 toverify that the HTTP server process is running. Assume HTTP server 228is running so the result state of the test is set to NO_PROBLEM. Testerfault handler for the FullContactLost fault 240, child of theHTTPServiceFailure fault 238, performs an ICMP echo request (or ping)from the management station 222 at 10.253.2.104 to HTTP server machine228 at 10.253.102.15. Assume a response is received from the ping, thusthe result state of this test is also NO_PROBLEM. At this point, faultdetection engine 101 has verified that the management station cancontact the HTTP server machine 228 and the HTTP service is available.

[0115] Similarly, tester fault handler 152 for the DNSServerDown fault235 makes a query to the agent on DNS server machine 216 at IP10.253.100.1 to verify that the DNS server process is running. In thiscase, for example, no response is received from the agent on machine216. As shown in FIG. 10A, tester fault handler 152 asserts a testresult with state UNKNOWN on DNSServerDown fault 235 because it couldnot be positively determined if the DNS service is available or not. Theserver process could be running but a network failure may be preventingaccess from the management station. Next the FullContactLost faulttester handler 152 is also triggered for the other FullContactLost faultfrom source IP 10.253.2.104 to Destination IP 10.253.100.1, the DNSserver 216. Assume the DNS server 216 is down, this ping request failsand no response is received. Tester fault handler 152 asserts a resultwith state PROBLEM as shown by 236P in FIG. 10A.

[0116] As shown in FIG. 10B, according to the rules stated above, thecomposite result state for the HTTPServiceFailure fault 238NP isNO_PROBLEM, and the composite result state for the DNSServiceFailurefault 234P is PROBLEM. A CausedBy association is created between theDNSServiceFailure fault 234P and FullContactLost fault 236P.

[0117] A composite result state of PROBLEM is computed for the rootHTTPLostFault, which agrees with the original PROBLEM assertion, and aCausedBy association is created between the HTTPLost fault 232P and theDNSServiceFailure fault 234P. Thus, the root cause of the HTTP requestfailure is that the DNS server machine is down, preventing nameresolution of the HTTP server machine.

[0118] Additional, more complex diagnosis can be performed to check forother possible causes, such as a bad URL, configuration problems on theclient side such as invalid TCP/IP configuration for DNS and the defaultgateway, and hardware problems on the server side such as a hard diskfailure. Diagnoser fault handler 154 can be implemented for theFullContactLost fault to trace a path between the source and destinationand create faults. Such a path dependent test tests the network devicesand ports that the data traverses, at both Layer 2 and Layer 3, tofurther isolate a problem as is described in PCT application (docket No.A3-03WO) entitled “Systems and Methods for Diagnosing Faults in ComputerNetworks” filed on May 7, 2001, which is incorporated by reference.

[0119] Preferably, the FullContactLost faults 236 and 240 contain theclient's IP address as the source instead of the management station's,as in the above example. Thus, the tested path is from the client'sperspective. Also, in the above example, the HTTPLost fault 232 wasdetected by a software agent monitoring the HTTP server. Such a faultcould also be injected into the fault diagnosis system via a Help Deskby a user experiencing a problem as described in the co-pending PCTapplication (docket no A3-05WO) entitled “Help Desk Systems and Methodsfor use with Communications Networks” filed on May 7, 2001, which isincorporated by reference.

[0120] Numerous other embodiments not described in detail here can applythe principles described to particular applications and are within thescope of the claims.

What is claimed is:
 1. A method of diagnosing a fault, comprising theacts of: receiving fault data; creating a fault object; and performing aroot cause analysis on said object to determine a root cause.
 2. Themethod of claim 1 wherein said creating a fault object includesemploying a fault object factory using fault data.
 3. The method ofclaim 1 wherein said performing root cause analysis includes invokingspecific fault handlers.
 4. The method of claim 3 wherein said employingfault handlers includes employing a diagnoser fault handler.
 5. Themethod of claim 3 wherein said employing fault handlers includesemploying a tester fault handler.
 6. The method of claim 4 or 5 whereinsaid employing fault handler includes obtaining an ordered list of faulthandlers for a specified transition state of said fault object.
 7. Themethod of claim 4 wherein said obtaining the ordered list includesemploying a diagnoser fault handler registered for the type of theanalyzed object.
 8. The method of claim 5 wherein said employingdiagnoser fault handler includes transitioning fault object betweenprocessing states.
 9. The method of claim 5 further includingdetermining casualty.
 10. The method of claim 6 further includingperforming resolution and re-evaluation of fault objects.
 11. The methodof claim 6 wherein said employing said diagnoser fault handler includesdecomposing said fault object into at least two constituent faultobjects, wherein each said constituent fault object represents apossible cause of said received fault data.
 12. The method of claim 11further including employing a tester fault handler on each saidconstituent fault object.
 13. The method of claim 12 including employinga state transition diagram.
 14. The method of claim 13 wherein saidemploying the transition diagram includes using an initial state, atesting state, a detected state, and a completed state.
 15. The methodof claim 11 further including employing a fault object tree.
 16. Themethod of claim 15 wherein fault objects in said fault object tree arerelated by a MaybeCausedBy association.
 17. The method of claim 16wherein fault objects in said fault object tree are CausedByassociation.
 18. The method of claim 1 further including impact analysisof said determined root cause on a system from which said fault data wasobtained.
 22. The method of claim 1 further including prioritization.23. The method of claim 1 further including fault presentation thatdisplays fault result to a user.
 24. The method of claim 1 furtherincluding recourse that provides a way for a user to a system from whichsaid fault data were obtained.
 25. A system for analyzing a fault,comprising: a fault object factory constructed and arranged to receivefault data and create a fault object; and a fault diagnosis engineconstructed and arranged to perform root cause analysis of said faultobject.
 26. A system of claim 25 further including a fault detectorconstructed and arranged to detect said fault data in a monitoredentity.
 27. A system of claim 25 further including a fault repositoryconstructed and arranged to store and access said fault object.
 28. Asystem of claim 25 further including a fault handler constructed andarranged to be triggered by said fault diagnosis engine to analyze saidfault object.
 29. A system of claim 28 wherein said fault handlerincludes a fault handler tester.
 30. A system of claim 28 wherein saidfault handler includes a fault handler diagnoser.